The Brain Tumor Segmentation - Metastases (BraTS-METS) Challenge 2023: Brain Metastasis Segmentation on Pre-treatment MRI

The translation of AI-generated brain metastases (BM) segmentation into clinical practice relies heavily on diverse, high-quality annotated medical imaging datasets. The BraTS-METS 2023 challenge has gained momentum for testing and benchmarking algorithms using rigorously annotated internationally compiled real-world datasets. This study presents the results of the segmentation challenge and characterizes the challenging cases that impacted the performance of the winning algorithms. Untreated brain metastases on standard anatomic MRI sequences (T1, T2, FLAIR, T1PG) from eight contributed international datasets were annotated in stepwise method: published UNET algorithms, student, neuroradiologist, final approver neuroradiologist. Segmentations were ranked based on lesion-wise Dice and Hausdorff distance (HD95) scores. False positives (FP) and false negatives (FN) were rigorously penalized, receiving a score of 0 for Dice and a fixed penalty of 374 for HD95. The mean scores for the teams were calculated. Eight datasets comprising 1303 studies were annotated, with 402 studies (3076 lesions) released on Synapse as publicly available datasets to challenge competitors. Additionally, 31 studies (139 lesions) were held out for validation, and 59 studies (218 lesions) were used for testing. Segmentation accuracy was measured as rank across subjects, with the winning team achieving a LesionWise mean score of 7.9. The Dice score for the winning team was 0.65 ± 0.25. Common errors among the leading teams included false negatives for small lesions and misregistration of masks in space. The Dice scores and lesion detection rates of all algorithms diminished with decreasing tumor size, particularly for tumors smaller than 100 mm3. In conclusion, algorithms for BM segmentation require further refinement to balance high sensitivity in lesion detection with the minimization of false positives and negatives. The BraTS-METS 2023 challenge successfully curated well-annotated, diverse datasets and identified common errors, facilitating the translation of BM segmentation across varied clinical environments and providing personalized volumetric reports to patients undergoing BM treatment.


B
rain metastases represent the most common malignancy affecting the adult central nervous system (Le Rhun et al., 2021), affecting an estimated 20-40% of patients with systemic cancer (Percy et al., 1972;Tabouret et al., 2012;Posner, 1978;Nayak et al., 2012).Patients commonly have multiple lesions at different stages of treatment, therefore radiologic evaluation often extends beyond a mere comparison with the most recent scan.In clinical practice, a comprehensive assessment frequently involves reviewing several previous scans to monitor the progression or changes in the metastases over time which can be laborious and time-consuming (Jekel et al., 2022b;Kaur et al., 2023;Cassinelli Petersen et al., 2022).
The shift toward automated volumetric analysis and lesion organization in evaluating BMs is a transformative (Kaur et al., 2023;Ocaña-Tienda et al., 2023), transcending the conventional qualitative assessment methods to a personalized and time-efficient approach.Artificial intelligence (AI) based volumetric BMs assessments will not only improve the precision of measurements but also provide high-quality personalized reports of individual treatment response of brain metastases and thus influence patient outcomes; it's about democratizing access to high-quality care Pinto-Coelho, 2023;Najjar, 2023;Tang, 2019.By integrating automated volumetric analysis into clinical practice, we can ensure more reliable and consistent measurements, extending these advanced diagnostic capabilities beyond specialized centers to a broader range of healthcare settings.Improved accessibility of personalized reporting is crucial, particularly for patients in regions where such specialized services were previously unavailable, thus broadening the scope of quality care to include more comprehensive and timely monitoring of disease progression and response to treatment.
The intricate task of accurately detecting, segmenting, and assessing BMs is pivotal for devising effective therapeutic strategies and prognostication.However, the efficacy of machine learning algorithms in this realm is inherently tied to the availability and quality of annotated medical imaging datasets (Zhou et al., 2020;Zhang et al., 2020;Xue et al., 2020;Jeong et al., 2024;Grøvik et al., 2020;Dikici et al., 2020Dikici et al., , 2022;;Charron et al., 2018;Bousabarah et al., 2020).Historically, the scarcity of large-scale, annotated datasets in the medical imaging field has limited the potential of machine learning algorithms.Many researchers find themselves constrained to smaller, local institutional datasets, which limits algorithm generalizability across different institutions (Greenspan et al., 2016).In this context, medical image analysis challenges-competitions to establish accurate segmentation algorithms-have emerged as crucial platforms, facilitating the development, testing, and bench-marking of machine learning algorithms by providing access to extensive, meticulously labeled, multi-center, real-world datasets.
Specifically, the domain of BMs analysis stands to benefit immensely from such collaborative initiatives.The complexities associated with BMs, such as the variability in size, shape, and location of lesions, necessitate sophisticated machine learning approaches that can adapt to the diverse characteristics of these metastatic manifestations (Cho et al., 2021).Moreover, the dynamic nature of BMs, with changes occurring over time and in response to treatment, underscores the need for algorithms capable of longitudinal assessment and multi-lesion segmentation.
The 2023 Brain Tumor Segmentation -Metastases (BraTS-METS) challenge marked a significant shift from previous BraTS challenges, which centered on adult brain diffuse astrocytoma (Zhang et al., 2020;Xue et al., 2020;Jeong et al., 2024).The scope was broadened to encompass a variety of brain tumor entities, thereby addressing the issue of data scarcity and methodological complexities inherent in earlier challenges.This challenge prioritized the segmentation of BMs on pre-treatment MR imaging.The goal of BraTS-METS 2023 was to establish a robust, accurate algorithm for segmenting metastatic lesions of virtually any size on diagnostic magnetic resonance imaging (MRI) using T1-weighted (T1) pre-contrast, T1 postcontrast, T2-weighted (T2), and fluid attenuated inversion recovery (FLAIR) sequences.The resulting standardized auto-segmentation algorithm was made openly accessible, thus facilitating its integration into clinical and research protocols across institutions.
Initially, the intention was to develop an algorithm dedicated to segmenting pre-treatment BMs (Figure 1, Step 1).This algorithm was fine-tuned to delineate the enhancing tumor, peritumoral edema, and necrotic portions of the metastases (Figure 1, Step 2).The ultimate aim was to establish a BMs consortium for future collaborative research (Figure 1, Step 3).This consortium is designed to foster a collaborative research environment, not only for the development of BM imaging algorithms but also for their clinical translation and community education efforts.

Background
Standard-of-care for evaluation of BMs includes qualitative assessment of changes in lesion size and number and two dimensional measurements performed by radiologists manually on PACS workstation.In clinical trials, the Response Assessment in Neuro-Oncology Brain Metastases (RANO-BM) guidelines predominantly rely on measuring the unidimensional longest diameter of lesions (Lin et al., 2015).However, these traditional criteria may not fully capture the complex dynamics and morphological changes of BMs over time, particularly given the heterogeneity and irregular growth patterns often associated with these lesions.
Recent advances in MRI technology, particularly the adoption of high-resolution 3D sequences such as T1 magnetization prepared rapid acquisition gradient-echo, T1 fast spoiled gradient-echo, and T1 three-dimension highresolution inversion recovery-prepared fast spoiled gradientrecalled, have significantly enhanced our ability to detect and monitor smaller BMs.The traditional threshold for target lesions, as outlined in the RANO-BM criteria proposed by Lin et al., set the minimum size at 10 mm in longest diameter, visible on two or more axial slices with a 5 mm or less interval (Lin et al., 2015).However, with the advancements in imaging, lesions as small as 1-2 mm can now be reliably detected, but because of significant inter-rater variability in measurement of lesions smaller than 5 mm, the consensus criteria still requires a lesion of at least 10 mm to be considered as measurable disease.Introduction of improved reproducibility and low variability between algorithm based measurements provides a potential for future re-evaluation of standardized assessment criteria to include smaller lesions.Indeed, recent practices have seen a shift towards a 5 mm minimum size threshold, aligning with the capabilities of current MRI technology, as highlighted by Qian et al. (2017).
Integration of automated techniques, such as deep learn-ing algorithms for segmentation and assessment, offers a promising avenue approach to enhance the precision and efficiency of volumetric evaluations, aligning with the requirements of the RANO-BM guidelines (Kanakarajan et al., 2023;Wang et al., 2023a;Yoo et al., 2022).The importance of multi-lesional segmentation and continuous assessment across serial imaging cannot be overstated.Such a comprehensive approach can benefit from the integration of automatic algorithms that are capable of efficiently detecting and segmenting metastases across multiple imaging time points, including pre-and post-treatment scans.The enhanced precision and efficiency of clinical assessments can complement the expertise of radiologists and other clinicians, which would aid not only in tracking disease progression and response to treatment but also in identifying new lesions at the earliest possible stage.Despite the potential benefits, the routine implementation of such automated techniques in clinical settings faces significant hurdles, given the extensive time required and the variability inherent in imaging techniques across different temporal scans.This variability often arises from disparate imaging equipment and the fact that different radiologists may interpret sequential scans for a single patient differently, introducing acquisition heterogeneity and inter-reader variability (Buchner et al., 2023;Mi et al., 2020).
Addressing the detection and segmentation challenges associated with smaller BMs is therefore of paramount importance.The successful development of targeted algorithms will expedite their translation to and adoption in clinical practice, providing a vital resource in the management of BMs.By successfully overcoming those challenges, we can provide algorithms that can be readily translated and implemented in clinical settings.

Related Works
While challenges remain in the field of automated BMs segmentation, recent studies are indicative of a promising trajectory toward achieving high levels of automation, consistency, and adaptability in clinical practice (Jekel et al., 2022b;Kanakarajan et al., 2023;Dang et al., 2022;Jekel et al., 2022a;Chen et al., 2023b).Kanakarajan et al. (2023) demonstrated a significant advancement with their development of a fully automated segmentation method for BMs using T1 contrast-enhanced MR images, which could significantly aid in evaluating treatment effects post-stereotactic radiosurgery.Similarly, Buchner et al. (2023) have identified core MRI sequences that are essential for reliable automatic BMs segmentation, providing a foundation for standardized imaging protocols and enhancing algorithmic consistency across various clinical settings.The integration of multi-phase delayed enhanced MR images has been explored by Chen et al. (2023b), who reported improvements in the accuracy of both segmentation and classification of BMs.This approach addressed the critical need for refined diagnostic tools that can adapt to the complex nature of BMs.Furthermore, Ottesen et al. (2023) have extended the capabilities of deep learning algorithms by implementing 2.5D and 3D segmentation techniques on multinational MRI data, enhancing the robustness and adaptability of these systems for diverse clinical environments.
The ongoing development and refinement of these automated segmentation tools are set to revolutionize the way BMs are assessed, bringing about a significant enhancement in the consistency and quality of patient care (Jekel et al., 2022b;Jalalifar et al., 2023).Yoo et al. (2023) underscored the importance of the data domain in self-supervised learning for accurate BMs detection and segmentation.This development points toward the creation of more adaptable and robust systems capable of functioning effectively across a variety of clinical scenarios.Moreover, advancements in the reduction of false positives within automated BMs segmentation underscore the growing feasibility and effectiveness of these technologies, even in diverse clinical environments, cementing their role as invaluable assets in medical imaging (Ghesu et al., 2022;Liew et al., 2023;Ziyaee et al., 2023).
Detecting smaller metastatic lesions, typically ranging from 1 to 2 mm, is pivotal in patient prognosis and treatment planning.Given the increased reliance on SRS (Vogelbaum et al., 2022), accurately identifying the exact number and localization of these small metastases becomes even more critical to ensure effective treatment and minimize the risk of missed targets, which could necessitate additional interventions, cause treatment delays, and increase healthcare costs (Minniti et al., 2011;Schnurman et al., 2022;Chen et al., 2023c) The development of a universally accepted, metastasisspecific AI tool represents a considerable gap in the current landscape, posing a barrier to the standard clinical use of GTV assessment for prognostication in patients with BMs.This challenge is compounded by the lack of a comprehensive public dataset, which would facilitate a fair comparison of existing BMs segmentation models.The availability of such a dataset could significantly accelerate progress by enabling researchers to benchmark and refine their models against a standardized dataset, thereby enhancing the reliability and accuracy of AI-powered segmentation tools.Bridging these gaps is essential for advancing the integration of AI in the prognostic evaluation of BMs, ultimately improving patient management and treatment outcomes.

Data
The BraTS-METS dataset included retrospectively collected multiparametric MRI (mpMRI) scans from diverse institutions, representing the variability in imaging protocols and equipment reflective of global clinical practices.Inclusion criteria encompassed MRI scans with the presence of untreated BMs with T1 pre-contrast, T1 post-contrast, T2, and FLAIR sequences.Participating institutions had obtained Institutional Review Board and Data Transfer Agreement approvals before contributing data, ensuring compliance with regulatory standards.These scans were Exclusion criteria included the presence of prior treatment changes, lack of one of the required MRI sequences, or imaging not technically acceptable due to motion or other significant imaging artifacts.The cases where posttreatment changes were noted were reserved for BraTS-METS 2024.
The dataset allocation for the BraTS-METS 2023 challenge adhered to the standard machine learning protocol, with 70% designated for training, 10% for validation, and 20% for testing.Ground truth (GT) labels were provided exclusively for the training set, while the validation set remained unlabeled to ensure integrity in algorithmic evaluation.The testing set was kept hidden from the participants.The use of additional data, whether public or private, was restricted to prevent bias in the algorithmic ranking process.Participants were allowed to reference external datasets only for publication purposes and were required to disclose such usage transparently in their manuscripts, along with results derived from the BraTS-METS 2023 dataset.

Imaging Data Description
The mpMRI scans included four sequences: non-enhanced T1, post-gadolinium-contrast T1 (T1Gd), T2, and nonenhanced T2-FLAIR, procured from various scanners and protocols.Standardized pre-processing was applied to all the BraTS-METS mpMRI scans.Specifically, the applied pre-processing routines included conversion of the DICOM files to the NIfTI file format, co-registration to the same anatomical template (SRI24) (Rohlfing et al., 2010), resampling to a uniform isotropic resolution (1mm 3 ), and, finally, skull stripping (Isensee et al., 2019).The pre-processing pipeline was made publicly available through the Cancer Imaging Phenomics Toolkit (CaPTk) (Pati et al., 2020;Rathore et al., 2018) and the Federated Tumor Segmenta-tion (FeTS) tool (Pati et al., 2022).Conversion to Neuroimaging Informatics Technology Initiative (NIfTI) stripped the accompanying metadata from the Digital Imaging and Communications in Medicine (DICOM) images and removed all protected health information from the DICOM headers.Furthermore, skull stripping mitigated potential facial reconstruction/recognition of the patient (Greenspan et al., 2016;Cho et al., 2021).The specific approach used for skull stripping was based on a novel deep learning approach that accounts for the brain shape prior and was agnostic to the MRI sequence input (Juluru et al., 2020;Schwarz et al., 2019).

Tumor Labels
The annotation of tumor sub-regions aligned with Visually AcceSAble Rembrandt Images (VASARI) feature visibility and encompassed three labels: Gd-enhancing tumor (ET -label 3), surrounding non-enhancing FLAIR hyperintensity (SNFH -label 2), and the non-enhancing tumor core (NETC -label 1).ET is described as the enhancing portion of the tumor, characterized by areas of hyperintensity in T1Gd that are brighter than T1.NETC is identified as the presumed necrotic core of the tumor, which is evident as a non-enhancing focus surrounded by enhancing tumor.SNFH is defined as the peritumoral edema and tumor infiltrated tissue, indicated by the abnormal hyperintense signal on the T2-FLAIR images, which includes the infiltrative non-enhancing tumor, as well as vasogenic edema in the peritumoral region.In previous BraTS challenges, ET was segmented as label 4.However, starting from BraTS 2023, ET has been segmented as label 3 for consistency.The sub-regions are shown in Figure 2.

Tumor Annotation Protocol
The BraTS initiative, in consultation with domain experts, defined various tumor sub-regions to provide a standardized approach for their assessment and evaluation.However, alternative criteria for delineation could be established, resulting in slightly different tumor sub-regions.To ensure consistency in the GT delineations across various annotators, the following tumor annotation protocol was designed.Structural mpMRI volumes were considered (T1, T1Gd, T2, T2-FLAIR).
The BraTS-METS 2023 challenge focuses on three regions of interest: 1. Whole Tumor (WT) = Label 1 + Label 2 + Label 3 2. Tumor Core (TC) = Label 1 + Label 3 3. Enhancing Tumor (ET) = Label 3 WT describes the complete extent of the disease, encompassing TC and the peritumoral edematous/invaded tissue, typically depicted by the abnormal hyper-intense signal in the T2-FLAIR volume.While the radiologic definition of tumor boundaries, especially in infiltrative tumors such as gliomas, presents a well-known challenge, this is less problematic in BMs, which typically have well-defined borders of the contrast-enhancing portion.In most cases, the boundaries of the contrast-enhancing region of the BM and the surrounding FLAIR hyperintense edema are well defined.One of the major challenges in segmenting BMs lies in the overlap of edema between multiple lesions, which is why the segmentation of ET is separated from WT and treated as distinct entities.

Annotation Pipeline
To ensure uniformity in data imaging and tumor labeling, we established a comprehensive annotation pipeline (Figure 3).This pipeline facilitates the development of accurate GT labels and is divided into five key stages: pre-segmentation, annotation refinement, technical quality control (QC), initial approval, and final approval.

Pre-segmentation
The initial phase involved pre-segmenting imaging volumes using three distinct approaches: 2. nnU-Net trained on AURORA multicenter study (Kaur et al., 2023), which creates SNFH and tumor core (ET + NETC) labels.
The label fusion process varied for each label.SNFH (label -2) was fused using the STAPLE fusion algorithm to aggregate the segmentations from each automated segmentation algorithm, accounting for systematic errors (Warfield et al., 2004).ET (label -3) was fused using the minority voting algorithm to aggregate all enhancing tumor voxels identified by the automated segmentation algorithms, due to varying accuracies in detecting small metastases.NETC (label -1) is only produced by the nnU-Net trained on UCSF-BMSR.Algorithms trained on AURORA and Heidelberg datasets only segment TC and SNFH.Therefore, NETC overlays both ET and SNFH labels.• Approver I ○ Board-certified attending neuroradiologist • Removing random voxels and voxels outside the brain mask • Ensuring all images have the same parameters as the SRI24 atlas • Verifying the presence of all segmentations and masks are in the folder with original NIfTI images.
• Approver II ○ Secondary review by a different boardcertified neuroradiologist from the approvers pool

Annotation Refinement and Initial Approval
All pre-segmentations from the three models, along with fused segmentations, were provided to the annotators.Subtraction images, in which the non-contrast T1 sequence is digitally subtracted from the post-contrast T1 sequence, were also provided to aid in the annotation refinement process.Annotations were performed by a diverse group of more than 150 student annotators and volunteer neuroradiology experts, under the supervision of annotator coordinators (A.J. and K.K.).Cases requiring re-annotation due to incompleteness were identified and returned for correction.
During the process of annotation, the trainees participated in group reviews of cases, asked questions, and attended lectures by expert imagers.Completed student annotations were then reviewed by a pool of 52 experienced boardcertified attending neuroradiologists (approvers) recruited by the American Society of Neuroradiology, ensuring quality control and uniformity with the SRI24 atlas standards.
Approvers reviewed the volunteer annotations and either approved the case or returned it to students for reannotation.Additionally, a QC process was implemented, which included removing all random voxels and any voxels outside the brain mask, ensuring all images had the same parameters (space, orientation, and origin) as the SRI24 atlas, and verifying the presence of all segmentations and segmentation masks are in the folder with original NIfTI images.

Annotation Final Approval
Following refinement, each case underwent a secondary review by a different board-certified neuroradiologist from the approver pool, ensuring accurate metastasis segmentation and adherence to inclusion criteria.In cases of discrepancy, the second approvers made the necessary changes themselves without reverting to the trainees.Finally, a neuroradiologist (M.A.) with over 6 years of brain tumor expertise conducted a final dataset review, guaranteeing consistency across all annotations.

Common Errors of Automated Segmentations
Based on observations from previous BraTS challenges, common errors in automated segmentations were identified.The most typical errors in the current challenge included: 1. Automated algorithms missing small metastases.Enhancing metastasis was fused using the minority voting algorithm to aggregate all enhancing tumor voxels identified by the three algorithms.However, many small metastases were missed and were manually segmented by neuroradiology attendings.
2. Segmentation of white matter changes from microvascular disease.Peritumoral edema segmentations were checked by neuroradiology attendings and modified.
3. The segmentation of non-enhancing lesions that have intrinsic T1 hyperintensity.Voxels with intrinsic T1 hyperintensity were manually removed from ET segmentations.
These insights led to specific adjustments in the annotation process to enhance accuracy.

Performance Evaluation Framework
Participants were offered a baseline approach implemented in the Generally Nuanced Deep Learning Framework (GaN-DLF), a modular open-source framework maintained by the MLCommons organization.GaNDLF provides popular network architectures, but also allows users to leverage the functionality of other libraries, such as PILLOW and MONAI.Submissions were packaged in MLCube containers as described in the instructions provided in the Synapse platform.These submissions were registered to MLCommons' MedPerf, an open federated AI/ML evaluation platform.MedPerf automated the pipeline of running the participants' models on the evaluation datasets of each contributing site's data and calculating evaluation metrics on the resulting predictions.Finally, the Synapse platform retrieved the metrics results from the MedPerf server and ranked them to determine the winner.
Performance evaluation was based on Dice scores and 95% Hausdorff distance (HD95) for individual segmented lesions as defined by the three regions of interest: ET, TC and WT.Given that BMs are often small, sometimes comprising only a few voxels, it was clinically significant to assess segmentation algorithms based on their capacity to accurately detect and delineate both small and large lesions.Teams were ranked based on a combination of lesionwise Dice and Hausdorff distance scores across all evaluated test cases.False positives and false negatives were rigorously penalized, receiving a score of 0 for Dice and a fixed penalty of 374 for HD95.This methodical approach was uniformly applied across the three designated tissue classes, with subsequent aggregation of results by taking the mean score for each CaseID within each tissue category.

Lesion-wise Dice
where L is the total number of GT lesions and T P , F P , F N are the number of true positive, false positive and false negative lesions respectively.
All participants were evaluated and ranked using the same unseen testing data, which was not accessible to them.They were required to upload their containerized method to the evaluation platforms.The final top-ranked teams were announced at the 2023 Medical Image Computing and Computer Assisted Intervention Society (MICCAI) annual meeting, with monetary prizes awarded to the top-ranked teams in both tasks of the challenge.
For this challenge, each team was ranked relative to its competitors for each of the testing subjects, for each evaluated region (i.e., ET, TC, WT), and for each measure (i.e., Dice and Hausdorff).For example, each team was ranked for 59 subjects, for 3 regions, and for 2 metrics, which resulted in 59 × 3 × 2 = 354 individual rankings.The final ranking score (FRS) for each team was then calculated by first averaging across all these individual rankings for each patient (i.e., cumulative rank), and then averaging these cumulative ranks across all patients for each participating team.This ranking scheme has also been adopted in other challenges with satisfactory results, such as the Ischemic Stroke Lesion Segmentation challenge (Maier et al., 2017).
We then conducted further permutation testing to determine statistical significance of the relative rankings between each pair of teams.This permutation testing reflected differences in performance that exceeded those that might be expected by chance.Specifically, for each team, we started with a list of observed subject-level cumulative ranks, i.e., the actual ranking described above.For each pair of teams, we repeatedly randomly permuted (i.e., for 100,000 times) the cumulative ranks for each subject.For each permutation, we calculated the difference in the FRS between this pair of teams.The proportion of times the difference in FRS calculated using randomly permuted data exceeded the observed difference in FRS (i.e., using the actual data) indicated the statistical significance of their relative rankings as a p-value.These values were reported in an upper triangular matrix, providing insights of statistically significant differences across each pair of participating teams.

Analysis
The competition framework encompassed evaluations across three key regions: ET, TC, and WT, utilizing two primary metrics: lesion-wise Dice and lesion-wise HD95.These metrics have been developed primarily to evaluate the performance of models at the level of individual lesions, rather than on a whole-image basis.This approach ensured that our evaluation did not favor models that only captured large lesions, a limitation commonly observed with standard Dice scores.By assessing models on a lesion-by-lesion basis, we gained insights into their ability to segment all sizes of BMs accurately.
To implement this evaluation framework, we first isolated the lesion tissues (i.e., ET, TC, WT).We applied dilation to the GT labels for WT, TC, and ET to gauge the lesion's extent.This technique ensured that during connected component analysis, small lesions adjacent to a primary lesion were not misclassified as separate entities.It is crucial to note that the GT labels remained unchanged throughout this process.We conducted a 26-connectivity connected component analysis on the predicted labels and compared each component to the corresponding GT label on a component-by-component basis.We calculated the Dice scores and HD95 scores individually for each lesion (or component), assigning the aforementioned penalty, to all false positives and negatives.Subsequently, we computed the mean score for each specific case.
Acknowledging the variability in lesion significance arising due to human error, a volumetric threshold of 2 voxels (2 mm3 ) was established by an expert panel of clinical radiologists, below which the models' performance on deemed "small/false" lesions is not considered in the evaluation.This approach was primarily adopted to ensure that participants were not unfairly penalized for stray voxels in the GT labels, which may result from human error, or for small lesions unrelated to the pathology central to the challenge.The expert panel of clinical radiologists also determined the dilation factor, which was uniformly applied for combining lesions in the GT masks.A dilation factor of 1 voxel in 3D space was chosen because BMs can be small, and it is important to avoid combining these small BMs.
The code and detailed information on the lesion-wise evaluation metrics can be found here1 .

Dataset
Multiple datasets were contributed by individual institutions and were in various stages of annotation and approval (Figure 4).

Dataset Sources
Our annotation and approval pipeline, as previously described, was applied to datasets from a variety of institutions, including New York University (NYU), Yale University, Washington University, Cairo University (CairoU), Duke University, and the University of Missouri.The annotated NYU dataset is uniquely hosted on the NYU website2 , separate from the public BraTS repository.As for the UCSF dataset, synthetic T2 images were generated and shared on the UCSF website 3 .The Stanford University dataset, despite being publicly available, was not incorporated into our primary dataset due to the lack of T2 image sequences.These datasets were available and optional for additional training.For logistical reasons, the UCSF, Stanford, and NYU datasets were excluded from the validation and test phases of our project.
In all, 2712 cases were received from various institutes  2).All the source institutions were located in the United States, except for one in Egypt.

Lesion Characteristics
Table 3 provides a detailed overview of lesion count and sizes across the different dataset groups used in the BraTS-METS 2023 challenge.These data demonstrate the variation in lesion count and size across the dataset groups.

Performance Analysis
Table 4 provides the relative ranking for each team.Team NVAUTO ranked first in the challenge, with an average rank across subjects of 7.9 and a PatientWise mean of 0.38.Team SY placed second with a PatientWise mean of 0.41 across all patients.The supplementary material depicts the pitfall cases with figures illustrating the false positives or missed lesions.
Figure 5 provides a patient-wise comparison of segmentation accuracy across the different participating teams.The boxplots reflect the distribution of each team's accuracy per patient case per lesion-across all cases within the test dataset, with lower value signifying better performance.The teams NVAUTO, SY, and blackbean showed a notably higher median accuracy, alongside a relatively narrow in-  terquartile range (IQR).Conversely, DeepRadOnc displayed a wider IQR.
A description of the algorithms used by the top four winning teams are shown in Table 5.

Detailed Performance by Tumor Entities
Table 6 delineates the comparative performance of each participating team's Dice scores for each tumor entity (i.e., ET, TC, and WT).The team NVAUTO secured the top rank across all categories, exhibiting a mean Dice score of 0.60 for ET, 0.65 for TC, and 0.62 for WT.Notably, SY and blackbean shared the second rank in the ET segmentation, with a mean of 0.57.Figures 6, 7, and 8 further highlight the lesion-wise Dice scores (shown as panels A) and HD95 (shown as panels B) for each participating team for each tumor entity.
Figure 9 illustrates a comparative evaluation across the three tumor regions of interest where performance of the segmentation models is quantified using three metrics: lesion detection rate, sensitivity, and positive predictive value (PPV).The lesion detection rate was led by NVAUTO with rates of 76% for ET, 78% for TC, and 80% for WT.Closely following were blackbean and SY, with both achieving a 75% detection rate for ET and TC, and 76% and 72% for WT, respectively.In terms of sensitivity, NVAUTO again showed superior performance, with 90% for ET, 91% for TC, and 90% for WT, reflecting a high true positive rate.blackbean and SY exhibited comparably high sensitivity, around 89-90% across tumor entities.PPV results depicted NVAUTO at the forefront with 82% for ET, 84% for TC, and 84% for WT.Following suit, blackbean maintained a PPV of 79% across all tumor entities, and SY showcased a slightly lower yet robust PPV performance with 76%.

Algorithm Sensitivity to Lesion Size
Figure 10 provides insight into the models' performance in segmenting lesions of different sizes.This was analyzed by calculating a running average within an expanding window of tumor volume, starting with only the smallest tumors and progressively including larger lesions (Kelahan et al., 2022).
The graphs collectively indicate that segmentation algorithm performance diminishes as tumor size decreases, with all teams facing challenges in maintaining high Dice scores and lesion detection rates for smaller tumors.The HD95 data suggest that algorithms struggled with precision in delineating the contours of smaller lesions, reflected in greater distances from the ground truth, a trend particularly noticeable for tumors less than 100 mm 3 in volume.Despite these challenges, NVAUTO consistently outperformed its counterparts.

Discussion
The use of machine learning in medical imaging has brought notable improvements in detecting and segmenting BMs.Clinical evaluation of BMs has unique complexity because it requires volumetric measurements and organization of lesions to provide granular details on individual lesion treat- • nnU-Net and Swin UNETR CNN + ViT • Outputs of these networks are then subjected to a non-linear function • Processed outputs are combined through model ensembling to create ensembled predictions • Label-wise post-processing is then applied to these ensembled predictions to produce the final predictions for each label ment history and assess treatment response.Presence of BMs is often a prognostic indicator of poor outcome in patients with metastatic disease, significantly changing treatment options and impacting patient survival (Jekel et al., 2022a;Chen et al., 2023b;Ottesen et al., 2023).The 2023 BraTS-METS challenge has significantly driven forward the development of algorithms designed to manage the complex task of BMs segmentation.These algorithms provide clinicians with better tools to measure tumor volumes accurately, which is crucial for both treatment planning and patient outcomes.The varying performance among the participating teams underlines the inherent complexity  of tumor segmentation in diverse datasets.This diversity in results particularly highlights the difficulty algorithms face in consistently identifying and accurately segmenting small metastases, which remain a significant hurdle in the literature, clinical practice, and for BraTS-METs challenge participants.The assessment metric utilized in BraTS-METs 2023 challenge penalizes for false negatives and false positives, which provides overall low Dice coefficients but provides a metric that optimizes for selection of algorithms that will be easily translated into diverse clinical practices.The performance trends observed in the challenge demonstrate that while some progress has been made, the precise detection of small metastases continues to be the princi-pal challenge, limiting the overall effectiveness of current models.Enhancing the sensitivity and specificity of these models for small lesion detection is crucial, as this would lead to significant improvements in diagnostic accuracy and clinical outcomes.
While multiple algorithms have shown promise in accurately segmenting BMs with high Dice scores (Dikici et al., 2020(Dikici et al., , 2022;;Charron et al., 2018;Bousabarah et al., 2020), a critical limitation remains in their ability to detect very small lesions, i.e., under 5 mm in size.Accurately identifying and quantifying every lesion, regardless of size, is paramount for effective therapeutic planning and prognosis assessment.were missed on initial MRIs, despite meeting diagnostic criteria, but became detected upon subsequent imaging in patients undergoing repeat SRS courses (Fairchild et al., 2024).The radiographic evidence of these metastases could often be spotted in earlier scans, suggesting potential for improved early detection and treatment planning.This issue is particularly pronounced for lesions under 3 mm, which may go untreated initially, only to become apparent on future imaging (Fairchild et al., 2023).
The heterogeneity in the appearance of BMs-ranging from multiple small lesions to solitary large lesions with varying degrees of edema-presents unique challenges in their detection and management.Our review of the challenge outcomes shows that Team NVAUTO achieved the highest scores, with a mean lesion-wise Dice score of 0.60 to 0.65 across different tumor entities.While these results place them at the forefront, the scores also highlight that there is considerable potential for further advancements.The close performance of teams like SY and blackbean illustrates the competitive nature of the field and emphasizes the need for ongoing improvements in precision, especially for smaller and more challenging lesions.
It is essential to highlight how various models developed for the 2023 BraTS-METS challenge handled the segmentation of these critical, small lesions.Our analysis of model performance across different lesion sizes revealed significant variations in how these models managed lesion detection and characterization.For instance, NVAUTO exhibited exceptional performance across all lesion sizes, particularly with smaller lesions, surpassing the overall performance of many other models in the challenge.These model performance findings underscore the necessity for continuous improvement in the algorithms' sensitivity to tumor size variations, which is crucial for ensuring that all lesions, particularly the smaller and potentially more elusive ones, are accurately identified and appropriately managed in clinical settings.
In the realm of targeted therapies, such as radiation, precision in lesion segmentation directly influences treatment efficacy, as determining lesion sizes influences SRS dose.For example, lesions up to 20 mm may receive up to 24 Gy, which is adjusted based on the lesion's diameter to prevent severe neurotoxicity (Shaw et al., 2000).Misiden-tifying or overlooking even a single small lesion can lead to inadequate treatment coverage, potentially resulting in suboptimal patient outcomes and increased recurrence rates (Kaal et al., 2005;Zindler et al., 2014).This underscores the necessity for advancements in diagnostic imaging techniques and highlights the critical role of machine learning technologies in achieving high precision in BMs detection and segmentation.In turn, these algorithms have the potential to significantly impact treatment response assessments and improve workflow efficiencies in clinical practice.
Accurate detection and precise quantification of lesion volumes are critical for determining patient prognosis.Prior research has shown that the GTV of metastatic disease within the brain significantly impacts patient survival, particularly when deciding between equivalent treatment options such as surgery and radiotherapy (Routman et al., 2018;Krist et al., 2022).This precise volume measurement helps clinicians choose the most appropriate therapeutic approach, ensuring that treatments like SRS or invasive surgical interventions are tailored to the patient's specific disease burden.
The ability to assess the GTV of BMs at diagnosis is crucial for patient outcomes.Accurately tracking changes in lesion volumes and perilesional edema over time is essential for informed decision-making in the post-treatment setting (Jalalifar et al., 2023).Treatments for brain metastatic disease utilize targeted approaches such as SRS, hypofractionated stereotactic radiation therapy (HFSRT), and hippocampal avoidance whole brain radiotherapy with less common use of whole brain radiation therapy due to neurotoxicity concerns.These techniques are particularly beneficial for patients with multiple metastases-even over 50-and rely heavily on precise volumetric localization of each metastasis (Simon et al., 2022).Unlike WBRT, which uses a 2D plan and does not require detailed localization, SRS and HFSRT involve complex 3D planning to accurately target each lesion.Furthermore, the dynamic nature of these metastases-with some increasing in size transiently before decreasing or resolving, and others possibly representing radiation necrosis or recurrence-underscores the necessity for reliable monitoring of metastasis sizes in relation to treatment timing (Wang et al., 2023a).This ongoing surveillance of the contrast enhancing component and peri-tumoral edema is vital to differentiate between active disease and treatment effects, thereby guiding the adjustment of therapeutic strategies (Kaur et al., 2023;Jekel et al., 2022a).
A significant challenge in creating large open science datasets involves safeguarding patient privacy and securing sensitive data (Vahdati et al., 2024;Shaw et al., 2024;Wang et al., 2024;Gichoya et al., 2023;Davis et al., 2024).This can be addressed by establishing robust security measures, such as data de-identification using skull and face stripping from the MRI scan to remove facial features.Moreover, fostering a culture of sharing and collaboration is essential for the broad applicability of these algorithms across different institutions.It is vital to balance promoting open science with maintaining patient safety, as this balance will drive future advancements in medical image analysis.This focus on open science not only broadens access to data but also introduces challenges in data handling and annotation, particularly for complex cases like BMs.
In the 2023 inaugural BraTS-METS challenge, a significant hurdle was the preparation of BMs datasets with expert-approved lesion annotations.Unlike other brain tumors such as glioblastomas or meningiomas, BMs display significant phenotypic variability and are often characterized by the presence of multiple synchronous lesions.This variability and multiplicity greatly complicate the annotation process, extending the time required from a few minutes to several hours depending on the number and complexity of lesions.
To address this, we introduced an innovative educational approach to annotation that not only facilitates the development of high-quality annotated datasets but also serves as a learning platform for annotators.This strategy involves a comprehensive educational series on BM imaging, basic MRI physics, and the principles of open science.This approach emphasizes deliberate learning (Mitchell and Boyer, 2020), where student annotators engage deeply with the material through practical experience, reinforced by weekly hands-on sessions with experts in brain tumor imaging and a structured curriculum.This method not only accelerates the learning curve but also ingrains a thorough comprehension of diverse BM presentations, turning the annotation process into a valuable educational experience and creating a rich training resource for future professionals.Additionally, the curriculum includes detailed discussions on various brain abnormalities such as microvascular white matter damage, microbleeds, and different stages of hemorrhage, further enriching their understanding and capabilities in annotating complex imaging datasets.
While our approach faced challenges due to the heterogeneity of the contributed datasets, this diversity is reflective of real-world clinical environments where algorithms must perform effectively across a wide range of data variations.Many cases were excluded from the analysis due to resection cavities, post-treatment changes, or the absence of brain parenchymal metastases.Inadequate skull stripping sometimes led to the inadvertent removal of metastases or failure to detect them, complicating accurate data interpretation.Furthermore, skull stripping can make it difficult to describe and differentiate dural-based lesions, such as metastases and meningiomas, and limits the evaluation of osseous metastases to the calvarium.
Another source of heterogeneity was due to differences in data acquisition, patient motion, protocols, slice thickness, and contrast injection timing that can lead to misregistration of images on different sequences.Particularly, the impact of slice thickness on lesion detectability is crucial, especially when targeting subcentimeter metastases.For example, the RANO high grade glioma criteria specify lesion visibility on two contiguous 5 mm thick slices, underscoring the importance of image resolution (Wen et al., 2023).During our manual segmentation processes, challenges arose when matching sequences acquired with varying 2D and 3D techniques, highlighting disparities in slice thickness and voxel sizes.In some instances, the co-registration of images appeared misaligned, potentially affecting the precision of segmentations.To address some of these issues, all images were standardized by registering them to the common SRI24 atlas (Rohlfing et al., 2010), promoting greater uniformity and adherence to the consensus brain tumor imaging protocol.This not only helped to mitigate the variations introduced by different imaging protocols but also enhanced the general applicability and effectiveness of the developed algorithms.These limitations contribute to the heterogeneity of data, which can have both positive and negative implications.While it can pose challenges for developing a uniform segmentation algorithm, it can also provide a diverse range of data that can benefit and generalize algorithm development.
While standardization of brain tumor imaging protocols (BTIP) have been proposed and are increasingly used in clinical trials resulting improved standardization of image acquisition, there is still a significant variability in imaging protocols among different imaging practices (Ellingson et al., 2021(Ellingson et al., , 2015;;Kaufmann et al., 2020).Increased implementation of standardized imaging protocols ensures consistency in the acquisition and interpretation of neuro-oncological images, which is crucial for comparing outcomes across studies and improving the reliability of lesion measurement across different institutions.
The complexity of annotating ground truth data for BMs represents yet another challenge in this year's BraTS-METS challenge, largely due to the typically small size of BMs and their frequent occurrence in large numbers within a single scan.Annotator fatigue is a notable concern, as the meticulous nature of the task can lead to errors or oversight.Throughout the annotation process, numerous instances necessitated segmentation revisions, as exemplified by the initial work done on the Yale BM dataset by a medical student, which later required refinement by experienced neuroradiologists (Kaur et al., 2023;Cassinelli Petersen et al., 2022;Jekel et al., 2022a;Ramakrishnan et al., 2023).The need for such revisions became particularly apparent when the dataset, along with its segmentations, was integrated into the BraTS challenge and adapted to a new atlas.This process often revealed previously unnoticed small lesions or inaccuracies in the depiction of necrotic tumor portions and peritumoral edema on FLAIR images.These experiences showcase the imperative of a robust ground truth (i.e.reference standard) approach that incorporates humans in the loop refinements and utilizes consensus techniques like STAPLE to ensure the highest data integrity (Warfield et al., 2004).The iterative nature of these annotations underscores the need for multiple rounds of review to ensure accuracy and the importance of standardizing annotation practices to facilitate more efficient data usage.To foster continual improvement and address any discrepancies, we encourage participants to engage actively with the challenge organizers, who are prepared to update and refine the segmentation data as necessary to maintain the integrity and utility of the dataset.

Conclusion
In the inaugural 2023 BraTS-METS challenge, we have addressed both technical and practical challenges in the establishment of datasets, high quality reference standard annotations, and assessment metrics for the development and application of machine learning algorithms for BM segmentation by challenge participants.The challenge has highlighted the critical need for algorithms capable of detecting even the smallest lesions, which are often overlooked due to human error or obscured by the limitations of imaging data.This task is complicated by the necessity of balancing the high sensitivity required for detection with the need to minimize false positives that can disrupt clinical workflows.The development of refined segmentation algorithms that effectively balance sensitivity with specificity is therefore essential.Utilizing multi-institutional datasets, the BraTS-METS challenge has been instrumental in advancing these developments, pushing forward the creation of models that are robust and adaptable across varied clinical environments.This approach optimizes the precision of these algorithms and potentiates their practical applicability, ensuring they can meet the nuanced demands of real-world medical practice.As we continue to refine these technologies, our goal remains to enhance the accuracy of diagnoses and treatment planning, ultimately improving patient management and outcomes in the challenging arena of brain metastasis treatment.

Step 1 :
Figure 1: Flow chart outlining the BraTS-METS 2023 vision, beginning with the pre-treatment BMs segmentation during the 2023 ASNR/MICCAI BraTS challenge.In this phase, segmentations were conducted on a select dataset subset to refine the dataset for algorithm development by participants.The dataset is set to expand in subsequent challenges through ongoing annotation of contributed brain MRIs.Future challenges will incorporate datasets with annotated post-treatment BMs, segmentations including the hemorrhagic component of tumors, and non-skull-stripped images to enhance the evaluation of dural-based and osseous metastases.These datasets, coupled with clinical data and patient demographics, will contribute to an inter-institutional BMs consortium, fostering collaborative research and the clinical application of algorithms through partnerships between academia and industry.
1. nnU-Net trained on the University of California, San Francisco BMs Stereotactic Radiosurgery (UCSF-BMSR) MRI Dataset (Rudie et al., 2024), which creates the ET label and was fused with predictions of NETC and SNFH from an nnU-Net trained on the pre-treatment BraTS 2021 glioma dataset.

Figure 2 :
Figure 2: Image panels illustrating the annotated tumor sub-regions across various mpMRI scans with segmentations of ET (yellow), SNFH (green), and NETC (red) done on ITK-SNAP.

Figure 4 :
Figure 4: Map of institutions that expressed interest in contributing data to the BraTS-METS challenge.

Figure 5 :
Figure 5: BraTS-METS 2023 boxplots of LesionWise ranking across patients for all participating teams on the BraTS 2023 test set (lower is better).

Figure 10 :
Figure 10: BraTS-METS 2023 plot of cumulative average of (A) Dice scores, (B) 95% Hausdorff distance (HD95), and (C) lesion detection rate as a function of increasing lesion volume.

Figure 11 :
Figure 11: Supplementary: Examples of Random Voxels Predicted as Non-enhancing tumor core

Figure 13 :
Figure 13: Supplementary: Examples of Random Voxels Predicted as Non-enhancing tumor core

Figure 15 :
Figure 15: Supplementary: Examples of Random Voxels Predicted as Non-enhancing tumor core

Figure 17 :
Figure 17: Supplementary: Examples of Random Voxels Predicted as Non-enhancing tumor core

Figure 19 :
Figure 19: Supplementary: Examples of Random Voxels Predicted as Non-enhancing tumor core

Figure 21 :
Figure 21: Supplementary: Examples of Random Voxels Predicted as Non-enhancing tumor core

Figure 23 :
Figure 23: Supplementary: Examples of Random Voxels Predicted as Non-enhancing tumor core

Table 1 :
Overview of publicly available datasets for BMs.

Table 2 :
Dataset sources in the BraTS-METS 2023 challenge.In the training dataset, 474 cases from UCSF and Stanford were included as optional because they did not have original T2 weighted images.
* The NYU dataset is part of the official challenge.Because it is hosted on a separate website, it is not included in the validation or test set.∧ UCSF and Stanford datasets are not part of the official challenge.Both datasets are provided as optional training sets.

Table 3 :
Lesion count and sizes for each dataset group.

Table 4 :
Top-performing teams ranking with cumulative ranks across subjects.Lower scores indicate better performance.

Table 5 :
Description of algorithms used by the top 4 winning teams.

Table 6 :
Teams' Dice scores, reported as mean ± standard deviation (median), and ranking based on individual tumor entities.